Fix TEGroupedLinear quantization for expert parallelism (EP > 1)#833
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThe changes refactor Mixture-of-Experts (MoE) calibration handling in PyTorch quantization across three modules. They add explicit MoE calibration validation and local expert amax synchronization in model_calib.py, remove the specialized _QuantMoELayer class from megatron.py, and improve argument parsing robustness in transformer_engine.py's grouped linear quantization path for varying input configurations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #833 +/- ##
========================================
Coverage 73.72% 73.72%
========================================
Files 196 197 +1
Lines 20457 20625 +168
========================================
+ Hits 15082 15206 +124
- Misses 5375 5419 +44 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0deb9b6 to
a85f04e
Compare
53aec4f to
febe313
Compare
Signed-off-by: James Shen <yueshen@nvidia.com>
febe313 to
197ecda
Compare
## What does this PR do?
**Type of change:** Bug fix / Compatibility update
**Overview:**
Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.
### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.
### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:
### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
# New signature: non_tensor_args is a tuple, m_splits is the first element
num_gemms = len(args[idx + 1][0])
else:
# Old signature: m_splits is directly args[idx + 1]
num_gemms = len(args[idx + 1])
```
## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.
* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
* More flexible argument parsing for quantization operations.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: James Shen <yueshen@nvidia.com>
## What does this PR do?
**Type of change:** Bug fix / Compatibility update
**Overview:**
Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.
### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.
### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:
### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
# New signature: non_tensor_args is a tuple, m_splits is the first element
num_gemms = len(args[idx + 1][0])
else:
# Old signature: m_splits is directly args[idx + 1]
num_gemms = len(args[idx + 1])
```
## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.
* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
* More flexible argument parsing for quantization operations.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: James Shen <yueshen@nvidia.com>
## What does this PR do?
**Type of change:** Bug fix / Compatibility update
**Overview:**
Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.
### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.
### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:
### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
# New signature: non_tensor_args is a tuple, m_splits is the first element
num_gemms = len(args[idx + 1][0])
else:
# Old signature: m_splits is directly args[idx + 1]
num_gemms = len(args[idx + 1])
```
## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.
* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
* More flexible argument parsing for quantization operations.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: James Shen <yueshen@nvidia.com>
## What does this PR do?
**Type of change:** Bug fix / Compatibility update
**Overview:**
Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.
### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.
### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:
### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
# New signature: non_tensor_args is a tuple, m_splits is the first element
num_gemms = len(args[idx + 1][0])
else:
# Old signature: m_splits is directly args[idx + 1]
num_gemms = len(args[idx + 1])
```
## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id /models/Qwen3-30B-A3B \
--export-quant-cfg fp8 \
--megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
--tp 8 \
--ep 8
# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
--megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
--hf-model-id /models/Qwen3-30B-A3B \
--tp 8 \
--ep 8
```
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.
* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
* More flexible argument parsing for quantization operations.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?
Type of change: Bug fix / Compatibility update
Overview:
Fix
te_grouped_quantized_linear_fnargument parsing for TEGroupedLinear quantization when parallelism configuration results in fewer local experts per GPU.Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR #2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int], use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple, *weights_and_biases) where non_tensor_args = (m_splits, use_bias, is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE versions because it tries to access m_splits directly from args[idx + 1], but in TE >= 2.10, that position contains the non_tensor_args tuple instead.
Root Cause
The code assumed m_splits was always directly accessible at args[idx + 1], but TransformerEngine PR #2377 changed the signature to pack all non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with
num_gemms=21, threshold=44) as an example:Solution
Added version checking to handle both signatures:
Usage
Works seamlessly with any TransformerEngine version:
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.